feat(BA-5878): re-supply legacy live_stat stats.{max,avg,rate} from Prometheus#11360
Open
seedspirit wants to merge 11 commits intomainfrom
Open
feat(BA-5878): re-supply legacy live_stat stats.{max,avg,rate} from Prometheus#11360seedspirit wants to merge 11 commits intomainfrom
seedspirit wants to merge 11 commits intomainfrom
Conversation
seedspirit
added a commit
that referenced
this pull request
Apr 28, 2026
seedspirit
added a commit
that referenced
this pull request
Apr 29, 2026
Reflect the new direction: agent-side stats.max / stats.avg are exported as Prometheus value_type labels instead of computing them via window-based PromQL queries. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
seedspirit
added a commit
that referenced
this pull request
Apr 29, 2026
seedspirit
added a commit
that referenced
this pull request
Apr 29, 2026
Reflect the new direction: agent-side stats.max / stats.avg are exported as Prometheus value_type labels instead of computing them via window-based PromQL queries. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
f45c9a0 to
9f6a8ee
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends the container utilization Prometheus metric to carry agent-computed rolling window statistics (stats.max / stats.avg) as additional value_type label variants (max / avg), and updates the manager-side live-stat query path to read/merge these additional samples without adding extra PromQL window queries.
Changes:
- Agent: emit
value_type=max/value_type=avgsamples for container metrics whenstats.max/stats.avgare present. - Common/Manager: add
MAX/AVGto the sharedValueTypeenum and update live-stat query merging to handle an extensible query bundle. - Tests/Changelog: add unit tests for live-stat query composition/merging and add a feature note.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/ai/backend/agent/stats.py |
Adds stats.max / stats.avg emission as Prometheus value_type samples during container stat collection. |
src/ai/backend/agent/metrics/types.py |
Introduces MAX_METRIC_KEY / AVG_METRIC_KEY and includes them in ALL_METRIC_VALUE_TYPES. |
src/ai/backend/common/clients/prometheus/types.py |
Extends ValueType with MAX / AVG. |
src/ai/backend/manager/repositories/metric/repository.py |
Refactors live-stat query merging to iterate over gathered responses. |
tests/unit/manager/services/utilization_metric/test_container_metric.py |
Adds tests for live-stat query bundle shape and merge behavior with max/avg value types. |
changes/11360.feature.md |
Documents the new agent-side export behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fregataa
reviewed
Apr 30, 2026
jopemachine
reviewed
May 8, 2026
Revert agent-side stats.max/avg export in favor of computing max/avg via PromQL window expressions on the manager. Container live-stat fan-out grows from 3 to 5 queries (gauge / diff / rate / max / avg). - Re-introduce MAX / AVG to ValueType plus from_legacy_live_stat_label to map "stats.max" / "stats.avg" labels back into typed value types. - Build max/avg templates that union a gauge sub-expression with a rate sub-expression and label_replace the result back to the legacy "stats.max" / "stats.avg" label. - Classify metrics into gauge-shape vs rate-shape stats sources, with exact names for built-ins and regex patterns for accelerator metrics. - Repository merges 5 query responses in a generic loop instead of unpacking three buckets. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PromQL parser rejects \- as an unknown escape sequence inside a regex literal, so re.escape over-escaping broke every container live-stat query (kernel_id UUIDs always contain hyphens). Strip the backslash from \- after escaping so the rendered queries are RE2-acceptable. Also drop the unused ValueType.RATE — no producer ever emits it and no consumer matches on it; only MAX/AVG round-trip to the legacy stats.* labels. And include the underlying exception in the warning emitted from MetricRepository.query_container_live_stats so "empty results" no longer hide the real Prometheus failure mode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s_query
The gauge and rate branches both produced the same
label_replace + *_over_time + sum_by skeleton, differing only in
whether the inner selector was wrapped with rate(...[window]). The
two branches embedded that skeleton as inline f-strings, which
hid the symmetry and was hard to read.
Extract two methods on FixedQueryBuilder:
- _utilization_selector() — builds {METRIC}{labels}, hiding the
brace-doubling away from rendering callers.
- _window_stat_subquery() — wraps a selector in
label_replace({stat_fn}((sum by ({group_by})({selector}))[window:]),
"value_type","{label}","value_type",".*"). This is the single
place where the stats-subquery shape lives.
_render_stats_query now reads as: pick a regex, build a selector
(optionally wrapped in rate()), and feed it through the same
skeleton. Rendered output is byte-for-byte unchanged
(characterization tests pass without touching the fixture).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The five fields of ContainerLiveStatQueries were built through three asymmetric idioms: gauge/diff/rate constructed _LiveStatQuerySpec on the fly inside the call, max/avg used pre-built module constants but asked the caller to compute kernel_id_regex/group_by and pass them positionally. Promote the three filtered-query specs to module-level Final constants (_GAUGE_LIVE_STAT_SPEC / _DIFF_LIVE_STAT_SPEC / _RATE_LIVE_STAT_SPEC) so they sit alongside _MAX_STATS_BUCKET / _AVG_STATS_BUCKET, and rename the two builders so all five fields read as self._build_*_preset(kernel_ids, _CONSTANT). The kernel_id_regex / group_by computation moves inside _build_window_stats_preset where it is actually consumed, instead of leaking to the caller. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The match over MetricType already covers all three variants, so the case _ -> raise UnreachableError(...) arm was dead code that Pylance flagged as unreachable. mypy's exhaustiveness check now enforces the same invariant statically — adding a new MetricType variant without a case will fail type check, which is the same protection the runtime raise gave but caught earlier. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The two helpers were named "to_live_stat_label" / "from_legacy_live_stat_label", which obscured their actual asymmetry: only the encoder produces the legacy "stats.<name>" form, while the decoder accepts both legacy and current shapes. Rename them to: - ValueType.to_legacy_live_stat_label — encoder, legacy emission - ValueType.from_live_stat_label — decoder, accepts either form and document the "stats." prefix as the legacy convention so the removeprefix branch reads as historical compatibility, not generic parsing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ture
The window-stat gauge patterns lagged behind what legacy accelerator
plugins actually publish:
- *_power: legacy emits both stats.max and stats.avg; this PR was
emitting only stats.max.
- *_temperature: legacy emits both stats.max and stats.avg; this PR
was emitting neither.
Surveyed plugins (rebellions/common, rebellions/atom_max, habana, ipu,
mock):
- All emit *_mem (max only) and *_util (max + avg).
- Only mock currently emits *_power and *_temperature, both with
{avg, max} filters.
Extend STATS_MAX_GAUGE_METRIC_PATTERNS to include _temperature and
STATS_AVG_GAUGE_METRIC_PATTERNS to include _power and _temperature so
the new pipeline matches what every legacy plugin actually publishes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t_rx/net_tx
Legacy live_stat consumers — most importantly the
legacy_compute_session.net_rx_bytes / net_tx_bytes / io_read_bytes /
io_write_bytes GraphQL resolvers — read the "stats.rate" key from the
per-metric live_stat dict. The new PromQL pipeline previously emitted
nothing under that label, leaving those four legacy fields uncovered:
- io_read / io_write: legacy agent's stats_filter={"rate"}, which the
new pipeline did not produce.
- net_rx / net_tx: legacy agent's stats_filter is empty, so the agent
publishes no stats.rate at all even though the legacy resolver
expects it — making the resolver always return 0. The new pipeline
now produces a value where legacy never did.
Two metric shapes flow through the new bucket:
- Gauge-shape (net_rx, net_tx): the metric's `current` value is
already a per-second rate (set by agent's
current_hook = lambda m: m.stats.rate), so PromQL only needs to
sum across replicas and label_replace to "stats.rate".
- Counter-shape (io_read, io_write): the value is a cumulative byte
counter, so PromQL applies rate(...[window]) before label_replace.
Live verified against Prometheus on a running kernel:
{net_rx, stats.rate} = 27530
{net_tx, stats.rate} = 30378
{io_read, stats.rate} = 0
{io_write, stats.rate} = 0
Re-introduces ValueType.RATE (and its to_legacy_live_stat_label /
from_live_stat_label round-trip) that was removed earlier in the
branch when no producer existed; the round-trip is now load-bearing
again.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…bels
Earlier the new pipeline emitted "stats.max" / "stats.avg" / "stats.rate"
as the rewritten value_type label, mirroring the legacy live_stat dict
keys. That coupling was unnecessary: the labels are synthesized at
query time via label_replace and never persist in the Prometheus TSDB
(only "current" / "capacity" series actually get scraped). The legacy
"stats." prefix is a Valkey live_stat artifact that future legacy-
compat consumers should re-attach in a dedicated converter, not
something the manager pipeline needs to carry.
The new pipeline now emits clean, ValueType-enum-aligned labels:
stats.max -> max
stats.avg -> avg
stats.rate -> rate
Round-trip helpers go away:
- ValueType.to_legacy_live_stat_label() removed (was a 1-arm match
that always returned "stats.{value}" for the three stat slots).
- ValueType.from_live_stat_label() removed; KernelMetricValuesByKernel
parses the response label via ValueType(value_type_str) directly.
Drop trailing .value on f-string interpolation of ValueType members —
StrEnum's __format__ already returns the string value, so
f"{ValueType.MAX}" == "max" without the explicit attribute access.
Test fixtures and the from_prometheus_response characterisation test
now use the bare labels.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The original phrasing led with the implementation ("window-based max/avg
queries via PromQL") rather than the user-facing motivation ("a
parallel supply for legacy stats.* live_stat fields that doesn't suffer
the agent-restart / accumulator failure modes"). Updates the changelog
entry to lead with the compat angle, expands coverage to include
stats.rate (now also produced), and notes the restart-safe / window
semantics that make this a meaningful upgrade over the legacy producer.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5 tasks
…oint Derive STATS_MAX_GAUGE_METRIC_PATTERNS / STATS_AVG_GAUGE_METRIC_PATTERNS from a single _ACCEL_GAUGE_SUFFIXES_* source so adding a new accelerator metric kind (e.g. clock, voltage) is a one-suffix edit instead of editing two regex strings in lockstep. Functionally identical — the generated regex is the same alternation, just sorted alphabetically. Snapshot tests adjusted for the new sort order. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The legacy
live_statresolvers —mem_max_bytes,cpu_using,io_max_scratch_size,net_rx_bytes,net_tx_bytes, and friends — readstats.max/stats.avg/stats.ratekeys produced by an in-memoryMovingStatisticsaccumulator on the agent. Each running container's metric carries aMovingStatisticsinstance holding running per-sample max / sum / count over every sample observed (sampled every 5 s viaUTILIZATION_METRIC_INTERVAL = 5.0), plus the latest 2 samples for rate calculation. Each metric declares astats_filterwhitelist controlling which slots get published.That accumulator has serious failure modes:
stats.maxsilently rewinds to "max since the agent process happened to last restart" — not lifetime peak.avgflattens forever._countis unbounded; a session that ran hot then went idle drifts toward the idle reading.stats.rateis computed from the last two samples only, so the first post-restart reading is0.net_rx/net_txthe agent'sstats_filteris empty, so the agent never actually publishesstats.rate— the legacy GraphQLnet_rx_bytes/net_tx_bytesresolvers always return0.What this PR provides
A parallel manager-side supply for the same
stats.{max,avg,rate}slots, computed on demand from Prometheus via window queries (MetricConfig.timewindow, default 5m). The legacy agent producer stays in place — this is a drop-in alternative source that future legacy compat code can route to.The conceptual switch is "lifetime peak as remembered by the agent" → "recent peak as recorded by Prometheus." The new supply:
net_rx/net_txstats.ratehole that the agent's emptystats_filterleftThe query results carry clean
value_type=max/value_type=avg/value_type=ratelabels (without the legacystats.prefix). A futureLegacyLiveStatConverterwill re-attach thestats.prefix when serving legacy clients; the manager pipeline itself doesn't need to carry the legacy encoding.Coverage matrix (what each pipeline publishes)
stats.maxstats.maxstats.avgstats.avgstats.ratestats.ratecpu_utilmemio_scratch_sizeio_readio_writenet_rxnet_tx*_mem*_util*_power*_temperatureNotes:
stats.ratefornet_rx/net_txis broken — always returns0. The agent's measurement entries for these omitstats_filter(default empty), so the agent never publishesstats.rateto Valkey for net_rx/net_tx. The legacy GraphQLnet_rx_bytes/net_tx_bytesresolvers read the missing key and fall back to0. The agent's per-second rate IS published, but ascurrent(viacurrent_hook = lambda m: m.stats.rate), which the legacy resolver doesn't read — an internal contract mismatch.stats.ratefornet_rx/net_tx— passthrough, norate()wrap. Thecurrentseries in Prometheus already carries per-second rate (thanks to the agent'scurrent_hook), so the new query just sums across replicas andlabel_replaces thevalue_typelabel fromcurrenttorate. Applying PromQLrate()to an already-rate series would compute rate-of-rate and produce nonsense.stats.rateforio_read/io_write—rate(...[5m])over the counter. Nocurrent_hookhere, socurrentis the cumulative byte counter. The new query applies PromQLrate(...[5m])to convert it to per-second bytes, thenlabel_replaces the result torate. Conceptually matches what the legacystats_filter={"rate"}slot held, except computed over a 5-minute window instead of the agent's last-2-sample delta.Live verification
Manager restarted on this branch, halfstack with Prometheus 3.1.0, one idle running kernel (1 CPU / 1 GiB):
memcpu_utilio_scratch_sizenet_rxnet_txio_readio_writeIdle workload across the 5-minute window →
current/stats.max/stats.avgconverge. Same kernel via the legacy GraphQL resolver returnedlive_stat: null(Valkey path empty in this env), confirming all values above came from the new pipeline.Bug fixed during verification
Prometheus's RE2 engine rejects
\-asparse error: unknown escape sequence U+002D '-'. Python'sre.escapeover-escapes hyphens, so every kernel-live-stat query (UUIDs always contain hyphens) would have failed to parse:Fixed by stripping the redundant escape before sending.
Test plan
pants test tests/unit/manager/services/utilization_metric/test_container_metric.pypants fmt fix lint check --changed-since=HEAD~1stats.max/stats.avg/stats.ratepopulated as expected formem/cpu_util/io_scratch_size/net_rx/net_tx/io_read/io_write. Wiring the new pipeline behind the legacy GraphQL resolvers (theLegacyLiveStatConverter) is the follow-up.🤖 Generated with Claude Code